Goto

Collaborating Authors

 parallel corpora


Automated Snippet-Alignment Data Augmentation for Code Translation

Zhang, Zhiming, Zhu, Qingfu, Luo, Xianzhen, Wang, Yixuan, Li, Bohan, Che, Wanxiang

arXiv.org Artificial Intelligence

Code translation aims to translate the code from its source language to the target language and is used in various software development scenarios. Recent developments in Large Language Models (LLMs) have showcased their capabilities in code translation, and parallel corpora play a crucial role in training models for code translation. Parallel corpora can be categorized into program-alignment (PA) and snippet-alignment (SA) data. Although PA data has complete context and is suitable for semantic alignment learning, it may not provide adequate fine-grained training signals due to its extended length, while the brevity of SA data enables more fine-grained alignment learning. Due to limited parallel corpora, researchers explore several augmentation methods for code translation. Previous studies mainly focus on augmenting PA data. In this paper, we propose a data augmentation method that leverages LLMs to generate SA data automatically. To fully leverage both PA data and SA data, we explore a simple yet effective two-stage training strategy, which consistently enhances model performance compared to fine-tuning solely on PA data. Experiments on TransCoder-test demonstrate that our augmented SA data combined with the two-stage training approach yields consistent improvements over the baseline, achieving a maximum gain of 3.78% on pass@k.


CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

Bhattacharjee, Soham, Roy, Mukund K, Poojary, Yathish, Dave, Bhargav, Raj, Mihir, Mujadia, Vandan, Gain, Baban, Mishra, Pruthwik, Ahsan, Arafat, Krishnamurthy, Parameswari, Rao, Ashwath, Josan, Gurpreet Singh, Dubey, Preeti, Kak, Aadil Amin, Kulkarni, Anna Rao, VG, Narendra, Arora, Sunita, Balbantray, Rakesh, Majumdar, Prasenjit, Arora, Karunesh K, Ekbal, Asif, Sharma, Dipti Mishra

arXiv.org Artificial Intelligence

India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus's value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.


CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

Zhuang, Wenhao, Sun, Yuan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.


UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations

Lu, Qiuyang, Shen, Fangjian, Tang, Zhengkai, Liu, Qiang, Cheng, Hexuan, Liu, Hui, Wen, Wushao

arXiv.org Artificial Intelligence

The quality and accessibility of multilingual datasets are crucial for advancing machine translation. However, previous corpora built from United Nations documents have suffered from issues such as opaque process, difficulty of reproduction, and limited scale. To address these challenges, we introduce a complete end-to-end solution, from data acquisition via web scraping to text alignment. The entire process is fully reproducible, with a minimalist single-machine example and optional distributed computing steps for scalability. At its core, we propose a new Graph-Aided Paragraph Alignment (GAPA) algorithm for efficient and flexible paragraph-level alignment. The resulting corpus contains over 713 million English tokens, more than doubling the scale of prior work. To the best of our knowledge, this represents the largest publicly available parallel corpus composed entirely of human-translated, non-AI-generated content. Our code and corpus are accessible under the MIT License.


The TUB Sign Language Corpus Collection

Avramidis, Eleftherios, Czehmann, Vera, Deckert, Fabian, Hufe, Lorenz, Lipski, Aljoscha, Villalobos, Yuni Amaloa Quintero, Rhee, Tae Kwon, Shi, Mengqian, Stölting, Lennart, Nunnari, Fabrizio, Möller, Sebastian

arXiv.org Artificial Intelligence

We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.


Building and Aligning Comparable Corpora

Saad, Motaz, Langlois, David, Smaili, Kamel

arXiv.org Artificial Intelligence

Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.


Edeflip: Supervised Word Translation between English and Yoruba

Abioye, Ikeoluwa, Ge, Jiani

arXiv.org Artificial Intelligence

In recent years, embedding alignment has become the state-of-the-art machine translation approach, as it can yield high-quality translation without training on parallel corpora. However, existing research and application of embedding alignment mostly focus on high-resource languages with high-quality monolingual embeddings. It is unclear if and how low-resource languages may be similarly benefited. In this study, we implement an established supervised embedding alignment method for word translation from English to Yoruba, the latter a low-resource language. We found that higher embedding quality and normalizing embeddings increase word translation precision, with, additionally, an interaction effect between the two. Our results demonstrate the limitations of the state-of-the-art supervised embedding alignment when it comes to low-resource languages, for which there are additional factors that need to be taken into consideration, such as the importance of curating high-quality monolingual embeddings. We hope our work will be a starting point for further machine translation research that takes into account the challenges that low-resource languages face.


Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Raja, Rahul, Vats, Arpita

arXiv.org Artificial Intelligence

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content.We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.


Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Fernando, Aloka, Ranathunga, Surangika, de Silva, Nisansa

arXiv.org Artificial Intelligence

Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.


Comparable Corpora: Opportunities for New Research Directions

Church, Kenneth

arXiv.org Artificial Intelligence

Most conference papers present new results, but this paper will focus more on opportunities for the audience to make their own contributions. This paper is intended to challenge the community to think more broadly about what we can do with comparable corpora. We will start with a review of the history, and then suggest new directions for future research. This was a keynote at BUCC-2025, a workshop associated with Coling-2025.